Sparse neural networks have garnered attention due to their theoretical promise of lowered computational demands and memory savings. However, to this date, the theoretical gains have largely failed to materialize due to the lack of hardware support for this kind of models. In this work, we explore the idea of neuron recycling which is inspired by pruning - a method often employed to induce sparsity in neural networks. We also present lessons we have learned along the way.
Introduction
Pruning is a well-established technique used to sparsify neural networks. It relies on the fact that typically a large part of a trained neural network can be masked without impacting the accuracy of the network, albeit often requiring additional fine-tuning in order to regain some lost performance. Despite multiple works proposing various neuron-selection criteria for pruning, magnitude-based pruning remains a viable option.
Understanding neuron magnitudes
One of the first natural questions we have asked ourselves, as we already knew that we wanted to take into consideration neuron magnitudes, was how does this metric behave and evolve during the training process.
This discovery raised several questions. Are the majority of neurons with very small magnitudes insignificant? Can they be dropped? Or could it be that these “small” neurons are activated rarely, but are crucial for certain tasks? Alternatively, perhaps when combined, they contribute significantly to the overall performance? Although we initially expected a different distribution, we wanted to explore and find the “ideal” distribution of neurons.
As a side note: interestingly, we noticed that this discrepancy does not happen in the last layers of the network. As an example, below you can examine magnitudes for neurons in the 8th FF layer of the network.
In order to find the ideal neuron distribution, we decided to give special attention to the feed-forward (FF) component of the network.
We have also examined the scenario of retraining only small magnitude neurons, only large magnitude neurons and random subsets. How does it affect the performance? The results are depicted on the following plot.
[plot will be attached here]
Interestingly, retraining only the smallest neurons yields the best results when compared to reinforcing high-magnitude neurons or random subsets. This provided a compelling argument in favor of our technique. However, it is important to note that these experiments were based on pretraining relatively small, BERT-based models. We were curious to see how our observations would translate to well-established, large-scale foundation models like BERT, T5, and GPT-2.
[plot - magnitudes in foundation models]
Upon examining the magnitudes in these foundation models, noticeable differences emerge. The magnitudes in T5 seem similar to those in our smaller models, while BERT and GPT-2 display more favorable distributions. What could account for these variations? We discovered that the use of weight decay plays a significant role. This simple but widely used technique has a considerable impact on the distribution phenomenon we’ve been investigating.
These findings support the idea of exploring neuron recycling more thoroughly and offer a solid foundation for further experiments. In subsequent sections, we will delve into the results of these investigations and share our insights.
Recycling
The central part of our work was a method we called neuron recycling. The whole process boils down to three phases, repeated periodically: training, selection and reinitialization